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<D " Abstract 



C/3 

We study the following problem of subset selection for matrices: given a matrix X £ R" x " 1 (jn > 
^vq | n) and a sampling parameter k (n < k < to), select a subset of k columns from X such that the 

pseudo-inverse of the sampled matrix has as smallest norm as possible. In this work, we focus on the 
Frobenius and the spectral matrix norms. We describe several novel (deterministic and randomized) 
approximation algorithms for this problem with approximation bounds that are optimal up to constant 
factors. Additionally, we show that the combinatorial problem of finding a low-stretch spanning tree in 
an undirected graph corresponds to subset selection, and discuss various implications of this reduction. 
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Given a full rank short-and-fat matrix X £ M. nxm with m > n (typically m > n) it is often of interest 
to compress X via selecting a subset of its columns. The goal of such a sampling procedure is to select 
the columns in a way that the sampled matrix behaves spectrally similarly to the original matrix, i.e. the 
singular values of the two matrices are comparable. Since deleting columns from X decreases the singular 
values monotonically (this is immediate from the interlacing property of the singular values; see Theorem 
8.1.7 on page 396 in [29]), the challenge is to select the columns that maximize the spectrum in the sampled 
matrix. One can naturally formulate this objective into the following combinatorial optimization problem 
(let [m] = {i £ N : i < m}, i.e. the set of natural numbers 1, 2, ...m). 
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Problem 1.1 (Subset Selection for Matrices). Fix X £ W ixm with m > n and a sampling parameter k 
with n < k < m. Let S C [m] denote a set of cardinality at most k and X5 £ ]R nx l 5 l contain the columns 
o/X indicated in S. Among all such possible choices of S, find an S op t such that, 

S op t £ argmin ||X^||g. 
s 

Note that there might be more than one possibility for S op t (the minimizer might not be unique). In 
the above, £ = 2, F denotes the spectral or the Frobenius matrix norm, respectively, and X^ denotes the 
Moore-P enrose pseudo-inverse 0/X5. 

Technically, the above definition corresponds to two different combinatorial optimization problems, one 
for £ = 2 and the other for £ = F. 

Problem 1.1 occurs in numerous applications in combinatorial problems involving sampling: column- 
based low-rank matrix approximation [10, 12]; feature selection in A;-means clustering [11, 13]; optimal 
experiment design [20, 21]; multipoint boundary value problems [20, 21]; sparse solutions to least-squares 
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regression [16, 8]; sensor selection in a wireless network [39]; rank-deficient linear least squares [27], and 
rank-deficient non-linear least squares [38], to name just a few. We discuss some of these applications in 
Section 6. 

However, our initial motivation for investigating Problem 1.1 was our observation that the combinato- 
rial problem of finding a low-stretch spanning tree in an undirected graph [2] corresponds to the Frobenius 
norm version of Problem 1.1. This connection is new and might be of independent interest. 

We study three aspects of Problem 1.1: algorithms, lower bounds, and applications. We now summarize 
our contributions in each of these aspects. 

Algorithms. In Section 3 we describe five different approximation algorithms for Problem 1.1. We 
suggest five different algorithms because no single algorithm has the lowest operation count; the choice of 
the most efficient algorithm depends on the actual values of m, n and k. Our algorithms are considerably 
faster than previously known algorithms, and they achieve the same or tighter approximation bounds. 
Table 1 summarizes the algorithms we propose, as well as previously known algorithms for Problem 1.1. 

Our first two algorithms, Algorithm 1 and Algorithm 2, which we describe in Theorem 3.1 and Corol- 
lary 3.3 (both in Section 3), respectively, are especially fast when k is close to m, since they form S by 
greedily removing columns. Both algorithms are deterministic. Algorithm 1 in Theorem 3.1 is designed 
for the Frobenius norm case (£ = F). It requires O ymn 2 + mn(m — k)) operations, and finds a subset S 
of cardinality k such that 

||xt||g< "- w + 1 .||xt||f.. 

11 5IIF - k-n + 1 11 " F 

Notice, for example, that if A; = m — a, for some small integer < a < 0.9(m — n + 1), then the 
approximation bound is 1 + 10a (m — n + l) -1 . 

Algorithm 2 in Corollary 3.3 is designed for the spectral norm case (£ = 2). It's operation count is 
O (mn 2 + mn(m — fc)) as well. It finds a subset S of cardinality k such that 

( 1 + ^).||x. K . 

Similarly, if, for example, k = n + 1 + /?, for some integer (3 close to m with < /3 < m — n + 1, then the 
approximation bound is 1 + n + n(m — n — l)/3 . 

The idea of greedily removing columns was previously used by de Hoog and R. Mattheijb in [20]. How- 
ever, our algorithms are at least a factor of n faster, and in some cases a factor of n 2 faster. Furthermore, 
our algorithms operate on a wider range of matrices: the algorithms in [20] require that all possible column 
subsets in X of size k or larger are non-singular, while our algorithms have no such restriction (see the 
paragraph Greedy Algorithms in Section 1.1 for a detailed discussion of the results in [20]). 

Our third algorithm, which we describe as Algorithm 3 in Theorem 3.7 (Section 3), is designed for cases 
that k is small, e.g. k = 0(n), a case which is common in applications (see Section 6). The algorithm's 
operation count is O [mn 2 + kn 2 m), and it constructs a subset S with cardinality at most k > n, such 
that, for both £ = 2, F: 

This algorithm is inspired by recent results on approximate decompositions of the identity [4, 10]. Notice 
that, for example, if k = O(n), the approximation bound is 1 + 0(m/k). 

Our fourth algorithm, which we describe as Algorithm 4 in Theorem 3.7 (Section 3), is designed for 
cases where both m and k are large (specifically, k = fi(nlogn)). It is especially fast since it is based on 
randomly sampling columns of the matrix. However, we do not use uniform sampling, so our bounds are 
independent of numerical properties (like coherence) of the matrix. The operation count of the algorithm is 
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Table 1: Summary of various algorithms for Problem 1.1 (our algorithms, as well as previous algorithms). 
X G j^ nxm j s a f u n rank matrix, fe denotes the number of sampled columns. S C [m] has cardinality fc. 
5 denotes a failure probability, which is assumed to be zero if it is omitted from the description. In the 
fourth line of the table, / > 1 is a parameter which trades accuracy with number of operations. In the 
fifth line of the table, r denotes the coherence of X: r = — maxj g r m i (VV T )ii, where V G R mxn contains 
the right singular vectors of X. Lemma 1 in [28] assumes that X has orthonormal rows; but, this can be 
extended to arbitrary X just by applying the result to V T . In the sixth line of the table, the formulation 
in [39] assumes that X is orthonormal, but this can be generalized as well. In the last line of the table, 
rj > is a parameter which trades accuracy with operations. 
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Table 2: Summary of lower bounds for Problem 1.1. By lower bounds, we mean that there is a matrix 
X G R" xm such that for every S, ||X^||| > 7||X^|||, for a value of 7 shown in the table. S C [m] has 
cardinality at most n < k < m. For £ = 2 and n = k, the bound is from Lemma 2.2 in [30]; the bound for 
£ = F is an immediate corollary. We prove the other bounds in Section 4. 
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0(mn 2 + k log k). For a fixed probability parameter 5 (0 < 6 < 1), and A; > [" 32n ln(2n/<5) ] , the algorithm 
constructs a subset 5 of cardinality at most k, such that, for both £ = 2, F, and with probability 1 — <5, 

||X^||| <4-m- HX^lf. 

If X has orthonormal rows, then, the operation count drops to 0(mn + klogk), i.e. linear in the size of 
the input. The analysis of the algorithm is based on the matrix concentration bound of [47]. 

Our last algorithm, Algorithm 5 in Theorem 3.11 (Section 3.4) is designed for k = n. It is based on 
the following theoretical contribution (Lemma 3.9 in Section 3.4): if we randomly sample a subset S of 
cardinality k > n with probability proportional to det(XsX^), then, 
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Algorithm 5 finds a subset S of cardinality k = n such that 

K 1 !!! < ||X 5 1 ||| < (1 + t/) • (m- n + 1) • HX+lll < (1 + rj) ■ (m- n + 1) ■ n ■ ||X+[|| , 

for any 77 > chosen by the user. This bound is deterministic but the bound on the number of operations 
is probabilistic. Specifically, for any < 5 < 1, we show that with probability 1 — 5, the operation count 
is O (mn 3 log <5 _1 log -1 (1 + 77)) . 

Our volume-sampling-based algorithms for the subset selection problem can be viewed as complemen- 
tary results to the volume-sampling-based algorithms designed before for low-rank matrix approxima- 
tion [23]. In low-rank matrix approximation, the subspace spanned by the columns that are selected by 
volume sampling contains a rank k matrix that approximates the best rank k matrix computed via the 
SVD; in our case, the objective is different but we show that volume sampling gives useful results as well. 



Lower Bounds. By lower bounds, we mean that there exists a matrix X 6 ^ nxm sucn that for every 
S of cardinality k > n, for £ = 2 or £ = F, we have ||Xtj||! > 7||X t ||| for some value of 7 which we 
call lower bound. We develop such lower bounds via, first, relating the subset selection problem to the 
so-called column-based matrix reconstruction problem [10], and then, employing existing lower bounds [10] 
for column-based matrix reconstruction. We present these results in Section 4; a summary of lower bounds 
appears in Table 2. Our lower bounds indicate that some upper bounds of de Hoog and Mattheij [20, 21] 
as well as ours are the best possible up to constant factors. This resolves an open question in [20, 21]. 

An alternative way to study optimality is to develop lower bounds of the form ||X^||| > 7||X^ |||. 
However, we were unable to prove such bounds for our algorithms, so we leave this as an interesting open 
question for future investigation. 



Applications. In Section 5, we study the connection between low-stretch spanning trees and subset 
selection. Using a result by Spielman and Woo [50], we prove that the stretch of any tree in an undirected 
graph equals the Frobenius norm squared of the pseudo-inverse of the sampled matrix that arises by 
sampling columns from an orthonormal matrix which is a basis for the row space of the so-called node- 
by-edge incidence matrix of the graph. This incidence matrix contains as many columns as edges in the 
graph; so, sampling columns from this matrix corresponds to sampling edges from the graph. We then use 
this reduction to develop novel algorithms for constructing spanning trees with low stretch in undirected 
graphs. Unfortunately, our algorithms are worse than the available state-of-the-art [26, 1, 41]. We believe, 
however, that the connection is interesting and might be useful to shed new light on the combinatorial 
problem of finding a low stretch spanning tree in an undirected graph. 

In Section 6 we use the subset selection algorithms of this paper to design novel algorithms for three 
other problems involving sub-sampling: column-based low-rank matrix reconstruction, sparse solution of 
least-squares problems, and feature selection in fe-means clustering. 
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1.1 Related Work 



Greedy Algorithms. In [20] de Hoog and Mattheij propose the following algorithm for the Frobenius 
norm version of Problem 1.1. The idea is to proceed by removing one column from X at a time. In the 
first iteration of the algorithm, they remove the column with index i\, where 



i\ = arg min Tr ( (XX — x^Xj 
i=l,...,m 



T \ 



Let Xi G R nx ( m !) be the matrix obtained after removing the zith column of X. In the second iteration 
of the algorithm, they remove the column with index %2 such that, 

i 2 = arg min Tr ( (XiX^ - XjX?) 1 ) , 

t=l,...,m—l \ / 

and so on, until m — k columns are removed. 

A straightforward implementation of this idea requires 0(mn 5 (m — k)) operations. However, one 
can use the Sherman-Morrison formula for rank one updates to the inverse of a matrix and improve the 
operation count to 0(n 3 + mn 2 (m — k)). 

Notice that the algorithm just described assumes (implicitly) that in all the iterations, removing a 
single column does not result in a rank deficient matrix; otherwise, for an iterate Xj (j = 1, m — k) and 
a column Xj (i = 1, m) whose removal will result in a rank deficient matrix, the inverse of XjXj — XjX?" 
is not defined. In [20] it is shown that this algorithm achieves the bound 

iixyi<T~ n+1 -iixt[||. 

11 s" F - k-n + 1 

However, the assumption just mentioned is not true in general. 

Our algorithms of Theorem 3.1 and Corollary 3.3 use the greedy removal idea as well. However, they 
find the columns to be removed in a different way. Our algorithms are substantially faster (at least a 
factor of n, and a factor of n 2 in some cases) than the algorithm of [20]. Additionally, our algorithms 
efficiently detect columns whose removal results in a rank deficient matrix, and avoid removing them. So, 
our algorithms work for any full-rank matrix X, without any restriction. 

We also mention that Theorem 1 in [21] describes a similar greedy deterministic algorithm with com- 
parable operation count but slightly worse approximation bounds than the algorithm of [20] (see Table 1 
for the precise statement of these results). On the positive side, this algorithm works for any X. 



Rank Revealing Factorizations. The subset selection problem that we study in this paper has deep 
connections, which we do not explain in detail, with the so-called Rank-Revealing QR [32] (and also 
see [12, 17] for a summary of available RRQR algorithms) and Rank-Revealing LU [33] factorizations. 

Worth special mention is the seminal work of Gu and Eisenstant [32] on Strong Rank-Revealing QR 
(RRQR). Algorithm 4 and Theorem 3.2 of [32] are presented for matrices X £ ]^ nxm with n > m. They 
can be easily adapted to the case where n < m (see Lemma 15 of [9] or Equation 3.1 of [14]). In this case, 
they can find a subset S of cardinality k < n with some bounds on all the non-zero singular values of X,5 . 
When applied to bound the spectral norm with k = n we have the following bound, 

||X+||1 < (l + / 2 n(m-n)) • ||Xt|||. 

For / > 1 and k = n, the operation count of this method is O (mn 2 log^m). RRQR approaches can only 
be used to sample k < n columns; extending these approaches to sample arbitrary k > n columns, which 
is the focus of this paper, is not obvious. 
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Incoherent Subset Selection. A recent result by Gittens [28] studies the subset selection problem in 
the context of the so-called coherence of X. The algorithm uses random sampling of columns. Gittens 
shows that this simple algorithm gives competitive bounds for incoherent matrices. 

Approximation via Convex Relaxation. Joshi and Boyd [39] explored the use of convex relaxation 
to solve Problem 1.1: initially, X5 is allowed to contain m columns of X, rescaled by weights in the interval 
[0, 1]. This is a convex program, which can be solved, for example, via an interior point algorithm. To 
get a feasible solution for X5, one needs an efficient rounding scheme to get strictly or 1 weights. No 
theoretical results are reported but the method is shown to be efficient in practice. 

Maximum-volume Subsets. Theorem 1 in de Hoog and Mattheij [20] shows that, for k > n, if X5 

maximizes det (Xf X r ) among all possible subsets T of cardinality k, then the following two bounds hold, 

||X^||1 < (l+n(m- fc)/(fc-n + l))-||X t |||; and, ||Xjy||f < (m - n + 1) / (k - n + 1) • n ■ \\X?\\l. Similar 
spectral norm bounds for k = n were shown before in Eqn. 2.4 of Theorem 2.2 of [36], Lemma 3.4 (p, = 1) 
in [45], Lemma 2.1 of [30], and Algorithm 3 in [32]. A similar Frobenius norm bound for k = n was shown 
before in Eqn. 2.13 of Theorem 2.3 of [36]. Notice that all these results do not imply any algorithm other 
than the naive procedure of testing all the (™) possible subsets of cardinality k (this procedure has an 
exponential operation count). 

Our Lemma 3.9 proves a similar result, which states that if one samples S with probability proportional 
to det (xJXs) , then, the same bounds hold in expectancy. Now, recent polynomial-time implementations 
(0(mn 3 ) operations) of such determinant-based random sampling [22, 34] allow us to design efficient 
algorithms (0(mn 3 ) operations with high probability) that achieve only slightly larger bounds. 

Finally, note that the strong RRQR algorithm of [32] can also find a local maximum- volume subset. 
By local maximum- volume subset, we mean that the volume of the subset found is always bigger than the 
volume of any subset obtained by interchanging a single column. 

Computational Complexity of Subset Selection. In [19], Civril and Magdon-Ismail study the spec- 
tral norm version of Problem 1.1, as well as three other similar subset selection problems, from a complexity 
theory point of view. They show that these problems are NP-hard. They give special emphasis to the 
problem of finding a subset S for which X5 has maximum volume, i.e. det(XsXj) is maximized. As we 
discussed above, the problem of finding the subset with maximum volume is connected to Problem 1.1. 

The computational complexity of finding a maximum volume subset was also investigated in the com- 
putational geometry literature. The problem is stated differently: finding a large simplex in a V-polytope. 
NP-hardness was established in [44], and exponential inapproximability was established in [40]. 

Variants of the Subset Selection Problem. Other variants of subset selection have been studied 
extensively in numerical linear algebra and computer science. Most of this work focused on spectral 
norm and the case of X5 containing rescaled columns from X (Theorem 3.1 in [47]; Theorem 11 in [57]; 
Theorem 3.1 in [4]) or X5 containing linear combinations of columns from X (Lemma 3.15 of [43]; Lemma 
6 of [48]; Theorem 1.3 of [53]). We should note that all these results give much better approximation 
bounds than our bounds for the spectral norm version of Problem 1.1. For example, the deterministic 
algorithm in Theorem 3.1 in [4], for any e > 0, selects and appropriately rescales 0(n/e 2 ) columns from 
X and guarantees an approximation bound 1 + e. 

We believe that the main reason of why our bounds are worse than the bounds in these two variants 
of the problem is the absence of rescaling. Selecting actual columns from X keeps less information from 
the original matrix than selecting rescaled columns or keeping linear combinations of the columns. In 
our case we keep nk numbers to summarize X. However, when one considers rescaling, she keeps nk + k 
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numbers, i.e. k columns of X and a weight for each column. Taking linear combinations of the columns 
of X essentially means multiplying X with some m x r (e.g. r = 0(n)) matrix W, thus the information 
used to represent X5 is nk + 0{mn). Our lower bounds formally argue that selecting actual columns from 
X can not give as tight bounds as the bounds obtained from the two variants of the problem mentioned 
in this paragraph. 

Finally, we mention that all these algorithms have found many applications in numerous problems 
involving subsampling: least-squares regression [3]; column-based low-rank matrix approximation [12]; 
spectral graph sparsification [52]; and, dimensionality reduction in clustering [13]. 

Restricted Invertibility. Bourgain and Tzafriri restricted invertibility result [7] states that there exists 
a universal constant C such that for every square invertible matrix A G M nxn whose columns have unit 
£2 norm, one can find a subset S C [n] of cardinality at least Cra/||A||2 such that 

||A 5 || 2 • ||A^|| 2 < V3. 

Given A, finding such a subset S is another variant of column subset selection. Tropp gave the first 
polynomial (randomized) algorithm for restricted invertibility [54]. A deterministic algorithm was recently 
suggested by Spielman and Srivastava [49]. However, restricted invertibility deals with selecting fewer than 
n columns that maximize the smallest non-zero singular value, while Problem 1.1 deals with selecting at 
least n columns so that the matrix is full-rank, and the smallest singular value is as large as possible. So, 
the problems are similar, but different. 

2 Preliminaries 

Basic Notation. We use [n] to denote the set {1, . . . ,n}. We use X, Y . . . to denote matrices; x, y . . . 
to denote column vectors. We denote the columns of X € jj"*™ xi,X2, . . . , x m G IR n ; Xj is column i 
of X. I m is the m x m identity matrix; nX m is the n x m matrix of zeros; ej is the ith standard basis 
vector (whose dimensionality will be clear from the context): all entries are zero except the ith entry which 
equals one. Xjj or (X)^ denotes the (i,j)th element of X. Vy denotes the jth element of a vector Vj. 
Logarithms are base two. We abbreviate "independent identically distributed" to "i.i.d". Finally, for a set 
A, we denote by C (A, k) the set of all subsets of A of cardinality k. 

Sampling Columns. In the context of Problem 1.1, S is a set of cardinality 1 < k < m, which contains 
some subset of the natural numbers from 1, 2, m (repetition of numbers is not allowed). X5 contains 
the columns of some matrix X G R nxm , indicated in S; sometimes we will use (X)s to denote the same 
matrix. The columns of X5 are ordered consistently with their order in X: if i, j are elements from S and 
i < j, then, the ith column of X will appear before the jth column of X in X5. Finally, Xj means (Xs) T 
and X^ means (Xs)t. 

Singular Value Decomposition. The Singular Value Decomposition (SVD) of X G R nxm is: 




with singular values a± > . . . a r > o>+i > . . . > a p > 0. Here, p = rank(X) and r is some rank parameter 
1 < r < p. We will often denote o\ as <7 max and a p as a m \ n , and will use Ui (X) to denote the i-th singular 
value of X when the matrix is not clear from the context. The matrices U r G M nxfc and U p _ r G R nx ( / '~ r ) 
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contain the left singular vectors of X; and, similarly, the matrices V r G ^jnxk aQ( ^ \ p _ r g ]g mx (p- r ) 
contain the right singular vectors of X. Finally, we repeatedly use the following column representation for 
the matrix V: V T = Y = [yi,y2, ...,y m ]. Here, the y^'s are vectors in W 1 . 

Moore-Penrose Pseudo-inverse. Let X G R nxm with SVD X = USV T . Then, X* = VS"^ 1 G 
R mxn denotes the Moore-Penrose pseudo- inverse of X (XT 1 is the inverse of S). 

Lemma 2.1 (Fact 6.4.12 in [5]). Let A G IR mxn ,B G R nxe , and assume that rank(A) = rank(B) = n. 
Then, (AB)t = 

Lemma 2.2. Let A G R nxm he a full rank matrix with m > n. Let B be an invertible m x m matrix. 
Then 

||(AB)t|| 2 < ||At|| 2 . IIB" 1 ^. 

Proof. 



(AB)t|| 2 = ( ( T min (AB))- 1 = (a min (B T A T ))- 1 = f 



|B T A T x| 



mm 



\x^0 ||X|| 2 

B T A T x|| 2 ||A T x| 



< min 
\x^o ||A i x|| 2 ||x|| 2 

(*) / ||B T A T x|| 2 ||A T x|| 

< min = • min — - — - — 

\x^o ||A i x|| 2 x^O ||x|| 2 

. ||B T y|| 2 . l|A T x|| 2 ^^ 

< mm — - — • mm — - — — 



v¥=o \\y\\2 x^o ||x|| 2 
(a min (B T ) • a^nCA 7 ))- 1 

2 • 



|At|| 2 . IIB- 1 ! 



In (*) we use the fact that A T is a full rank matrix with more rows than columns (so ||A T x|| 2 ^ for 
x^O). ■ 

Column Exchanges and Cramer's rule. For a matrix A, an index i and a vector v, we denote by 
A(i — > v) the matrix obtained after replacing the ith column of A by v. Notice that for square matrices 
A and B of the same dimension we have det(A) det(B(i — > v)) = det((AB)(i — > Av)). For an invertible 
square matrix A, recall Cramer's rule, which gives a formula for computing the components of x = A _1 b 
in terms of determinants. In our notation, the rule states that Xj, the ith position in x, is 

_ det(A(i -» b)) 
Xi ~ det(A) ■ 

Volume Sampling. Let X be a full rank matrix of dimensions n x m with m > n, and let n < k < m 
be some integer. Given a subset S G C {[m], k) define the probability of S by 

det (X 5 Xj) 



Ps 



EreC(H,fe) det ( x T x r) 



The values {Ps}seC([m],k) define a distribution over the sets in C ([m], k). We denote this distribution by 
VolSamp(X, k). That is, we write S ~ VolSamp(X, k) to denote that S is a random subset which assumes 
value in C ([m], A;), whose distribution is defined by 

Pr(5 = T) = P T . 
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We call this sampling distribution volume sampling due to the fact that de^X^X^) 1 / 2 is the volume 
of the parallelpiped defined by the rows of X5. Notice that if A is a square non-singular matrix then 
VolSamp(AX, k) = VolSamp(X, k) (this follows from the fact that det((AX) s (AX)|) = det(A) 2 det(X 5 x£), 
for every S). 

An efficient algorithm for sampling a set from VolSamp(X, n) was first suggested by Deshpande and 
Rademacher [22]. This algorithm was recently improved by Guruswami and Sinop, who showed how to 
sample such a subset with 0(n 3 m) operations [34]. There is currently no algorithm for sampling from 
VolSamp(X, k) for an arbitrary k > n. 



Other Known Results. In addition we use the following two known results. 

Lemma 2.3 (Special case of the Cauchy-Binet formula). Let A G M nxm ,B G M. nxm , and m>n. Then, 

det(AB T )= det(As) det(B|) . 

SeC([m],n) 

Lemma 2.4 (Theorem 1.2.12 in [37]). Let Ai > A2 > ... > A m , denote the eigenvalues of A £ ]R mxm . Let 
1 < k < m. Then, 

det(A 5 , s )= Yl U Xi 

SeC([m],k) SeC([m],k)ieS 

Here, A^s G M, kxk denotes the submatrix of A corresponding to the rows and the columns in S C \m\, 
which has cardinality k. 

Lemma 2.4 expresses the fc-th elementary symmetric function of the eigenvalues of A as the sum of 
the k-hy-k principal minors of A. 



3 Algorithms 

3.1 Deterministic Greedy Removal 

The algorithm of this section uses the same greedy removal strategy as in [20], but it is faster, since 
it exploits the SVD decomposition of the matrix and the ability to quickly update it. Additionally, our 
algorithm efficiently detects columns whose removal results in a rank deficient matrix, and avoids removing 
them (see the discussion in Section 1.1). We prove that our algorithm achieves the same approximation 
bounds as in [20]. The proof of [20] does not apply to our algorithm, since [20] assumes implicitly that in 
all the iterations, removing a single column does not result in a rank deficient matrix. 

Theorem 3.1. Fix X G W ixm (m > n, rank(X) = n) and sampling parameter m > k > n. Algorithm 1 
needs O (mn 2 + mn (m — A:)) operations and deterministically constructs a set S C [?n] of cardinality k 
with 

t 2 m-n + 1 f 2 nvtn2 ^ m-n + 1 , t||2 
a, \\-p < ■ X 1 w and XM U < ■ n ■ X 1 9 . 

Moreover, i/X contains orthonormal rows, the operation count is O (mn(m — k)). 

Before proceeding with the proof we state an auxiliary lemma. However, we defer the proof to Sec- 
tion 3.4 since this Lemma is a corollary of a Theorem that appears in that section. 

Lemma 3.2. Let X G flj nxm ^ m > n) be a full rank matrix. There exists a subset S C [m] of cardinality 
m-l such that X 5 is full rank and ||X^||| < m r ^"+ 1 • HX^H . 
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Input: X G M. nxm , sampling parameter n < k < m. 
Output: Set S C [m] of cardinality k. 
l: S [m] 

Compute the SVD of X 5o : X 5o = U (0) £ W Y W 
for 2 = 1, 2, . . . , m — k do 



Let the singular values of X < 5 ! _ 1 be crj* ^, . . . , <7n . 



2 
3 

4 

5: Let the columns of Y^" 1 ^ be {yr } r eSi-i- 
Denote by yj* ^ the /-th element of y£ l ^ . 

6: ji <- argmin re5! i;||yri)||2<1 ^ ^jj^jji J ■ 

{See proof on how to implement this step in a stable manner.} 
7: - {ji} 

8: Downdate the SVD of X 5i _ x to obtain an SVD of X 5 . = U^E^Y^. 

{Using an algorithm described in [31]} 
9: end for 
10: return 5 



Algorithm 1: A deterministic greedy removal algorithm for subset selection (Theorem 3. 



Proof of Theorem 3.1. The spectral norm bound is immediate from the Frobenius norm bound using the 
fact that for any matrix B, ||B||2 < ||B||p < rank(B) • ||B||2. So, we prove only the Frobenius norm bound. 

Let X = U£!V T = USY be an SVD of X (see also Section 2 for useful notation). First, we argue 
that XX T — XjxJ is singular if and only if ||yj||2 = 1- Suppose rank(XX T — XjX?") < n. Notice that, 
XX T — XjxJ = U5] (l n — y^y^) SU T . The matrix U is full rank (it is square unitary), so we find that 
S (l n — yiyj) £ is singular. We now observe that S is full rank (it is diagonal with positive values on 
the diagonal) as well, so we find that I n — y^y^ is singular. That can hold only if ||yj||2 = 1- Therefore, 
comparing the norm of y^ with 1 is an efficient way (once we have an SVD) under exact arithmetic to 
detect if XX T — x^xj is singular. 

We proceed with some calculations. Let a%, o~2, ■ ■ ■ , & n denote the singular values of X. Fix an index 
i. If XX T — XjxJ is not singular, then, 

Tr^XXT-x^)- 1 ) ® ^((U^U^U^SUy 1 ) 

(3 ^fVa+S-WE-- 



i-y/y 

® Tr(£- 2 )+T/ S ~ ly ^ S ~ 



i-y 4 T y< 

= ||xt||| + ^_.Tr(i]-V i (^ 1 yO T 
( e ) „ v t„2 , ll^^Villl 



X 'iF^ l_|| v .||2 

1 ny«ii2 

2 



(£ ,| X t||2 , Ejf I <J J ) 



IF " 1- 11*111 



(a) follows by replacing the SVD of X. (6) follows from the Sherman-Morrison formula (recall that we 
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assume that XX T — XjX? is not singular, so ||yi||2 7^ 1) and the identity U T U = I n , which also implies 
that Tr(UAU T ) = Tr(A). (c) follows by the linearity of the trace operator, (d) follows from the fact that 
1/(1 — Hyilll) is a scalar, (e) follows from the fact that for any matrix B we have Tr (BB T ) = ||B||p; in 
our case, we apply this equality to B = S _1 yj. (/) follows because E is diagonal. 



Algorithm. To facilitate the description of the algorithm, we assume the columns in X5 are indexed 
using their index in X. Our algorithm constructs S by iteratively removing columns. That is, we start 
with a complete subset Sq = [m]. Then we proceed with m — k iterations, since the goal is to select k 
columns. We reserve the index i to refer to the iterations of the algorithm; so, i = 1,2, ... ,m. 

Each iteration i of the algorithm starts with some C [m] of cardinality m — i + 1, and removes 
one index from it, to obtain 5, C of cardinality m — i. Our algorithm does exactly m — k iterations, 
hence it returns 5 m _fc of cardinality k. Additionally, in each iteration we maintain an SVD of the current 
subset, X^ = U^E^Y^. We denote the singular values of X^ by . . . ,(Jn\ and the columns of 
Y« by {y®} reSi . 

Our algorithm begins by computing the SVD of X^ = X. Then, for i = 1, 2, . . . , m — k, iteration i 
has two stages: 

1. Finding an index ji to remove. We then set Si = — {ji}. 

2. Updating the SVD X 5i = U (i) S w Y (i) . The algorithm needs only S (i) and Y w (no need to downdate 
UW). 

We now describe each stage in detail. Our algorithm implements the greedy removal idea of Theorem 
2 in [20], so ji is selected as to minimize ||xjj.||p (subject to constraint that Si is a subset of 
Specifically, the formula for ji is 



arg mm 

r , G<S I _i;||y^ 1) || 2 <l 



v 



1- llv^ll 2 
1 \\yr II2 



In the last equation, yf ^ is the (l,r) element of Y^ ^ or, equivalently, the I element of y^ ^ . Equa- 
tion (1) guarantees that ji minimizes [[X^. ||p. 

As for the second stage, we simply downdate the SVD of X^ j to obtain an SVD of X^, using the 
algorithm described in [31]. 



Stability issues. Equation (3.1) is unstable since it can potentially suffer from catastrophic cancellations 
when ||yr [[2 ~ 1. However, to find the minimizer we need to do only comparisons. That is, we need to 
be able to determine for two indices g,h G Si-\ whether 



el: ^y^^ l, /-^- l, ) 2 el, (ygr'W-" 



< 



1- llv^ll 2 1- llv^ll 2 

1 \\yg 1 1 2 1 WYh 1 1 2 

or not. It is easy to verify that provided \\yg ^[^ < 1 and ||y^ ^[^ < 1, the last equation holds if and 
only if 

71 71 71 71 

E (yrV^T+liyf "'"IE (yt VI- 11 ) 2 s E (y!r Vr^+ilyr'lUE WW"")' 

1=1 1=1 1=1 1=1 
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Input: X £ M. nxrn , sampling parameter n < k < m. 
Output: Set S C [m] of cardinality k. 

1: Compute the matrix V € ]^ mxn Q f the top n right singular vectors of X. 

2: Run Algorithm 1 with inputs V T and k to obtain S of cardinality k. 

3: return S 



Algorithm 2: A deterministic greedy removal algorithm for subset selection (Corollary 3.3). 



The last equation does not do any substraction, so it does not suffer from catastrophic cancellations. 

Another issue with equation (3.1) is that an index h £ Si-i is a candidate minimizer only if ||y^ ||2 < 
1. Under inexact arithmetic that will always be the case, even if removing the column results in a rank 
deficient system. This issue can be solved by replacing the test ||yl, \\2 < 1 with ||yi, \\2 < 1 — r for 
some small threshold r. 



Approximation bound. Recall that the spectral norm bound follows immediately from the Frobenius 
norm bound. We now focus on proving the Frobenius norm bound. 
We will show using induction that 

t 2 < m-n+l f 2 

11 S ' UF ~ m-i-n+l 11 llF ' K ' 

Since our algorithm returns S m -k the claim follows from (1). 

Equation (1) trivially holds for i = 0. Assume it holds for i — We now show it holds for i+. Note that 
the cardinality of is m — % + 1. Lemma 3.2 ensures that there exists a subset T% C of cardinality 
m — i such that 

t 2 < m-i-n + 2 _ t 2 < m-i-n + 2 _ m-n+l _ f 2 = m-n+l _ f 2 
— m — i — n + 1 5iF — m — i — n + 1 m — i — n + 2 F m — z — n + 1 F 

Our algorithm finds a subset Si C of cardinality m — i with minimal ||xjy. ||p, so 

t 2 [.__ ,|2 m — n+1 | 2 

X c. p ^ Prt F — : TT ' F • 

V m - z - n + 1 

Operation count. First, one needs 0(mn 2 ) operations to compute an SVD of X. Now, at iteration i, 
computing ji requires 0((m — i + l)n) operations. Downdating the SVD to find and Y^^ can be done 
be done in 0((m — i + l)n) operations 1 . There are m — k iterations, so overall, O (ran 2 + ran (m — fe)) 
operations suffice. If X has orthonormal rows, the operation count is just O (mn (m — k)) because the 
initial SVD is available. ■ 

We now describe an algorithm which achieves a slightly worse Frobenius norm bound than the bound 
in the previous theorem but a slightly better spectral norm bound. The set S of the corollary below and 
the one in Theorem 3.1 might be different. 



1 This is precisely Problem 3 in page 794 of [31]; the third paragraph in page 795 of [31] argues that this problem can 
be solved in 0(mn log 2 e) operations, where e is the machine precision. In our analysis we ignore the log 2 t term since e is 
constant, and log 2 e is not too big since typically e ~ 1CP 16 . Ignoring such terms is common in the analysis of SVD-type 
algorithms. 
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Corollary 3.3. Fix X G jj nxm ( m > n ^ rank(X) = and sampling parameter k > n. Algorithm 2 needs 
O (mn 2 + mn (m — k)) operations and deterministically constructs a set S C [m] of cardinality k with 

" 6llt k-n + 1 

Also, for i = 1, . . . ,n we have 
In particular, 

l|x^||<fi + ? (m ~ fc) Vl|xt[|l. 

II 5112 - ^ T- k _ n + 1 J II 112 

Moreover, z/X contains orthonormal rows, the operation count is 0(mn(m — k)). 

Proof. Let X = USV T be an SVD of X with U G R mxm , £ G M mxm , and V G M nxm . The algorithm 
of this corollary consists of applying the algorithm of Theorem 3.1 to the matrix V T from the SVD of 
X. Let S C [m] be the set found by the algorithm, and let S = [m] — S. Corollary 2 of [20] asserts that 
||X^ • X^Hp < n £™~^ ■ Now, Corollary 1 in [20] indicates that, if such a bound holds for ||X^ • X^Hp, then, 

nv-t n2 / m-n + 1 f 2 . 2 / n(m-A;)\ _1 2 

||X 5 || F < _ _ • n • ||Xt|| 2 ; for , = 1, n ^ (X) • ^1 + ^—^ j < a t (X 5 ) . 



3.2 Deterministic Greedy Selection 

The algorithm of this section is an application of the deterministic algorithm presented in [10], which is, 
in turn, a generalization of an algorithm from [4]. In particular, we use Lemma 10 from [10]. 

Lemma 3.4 (Dual Set Spectral Sparsification, Lemma 10 in [10].). Let V = {vi,...,v m } and U = 
{ui,...,u. m } be two equal cardinality decompositions of identity matrices: Vj G W 1 (n < m), Uj G M. e 
(£ < m), ^2iLiVivJ = I n , and Y^h=i u i u J = Given an integer k with n < k < m, there exists an 
algorithm that computes a set of weights Sj > (i = 1, . . . ,m) at most k of which are non-zero, such that 

O n S i V i V ^J - ( X ~ a7ld ai SiUiU 

The algorithm is deterministic and needs at most O (km (n 2 + £ 2 )) operations. Moreover, if the set U 
contains vectors from the standard basis from ffi m , the algorithm needs O (kmn 2 ^ operations. We denote 
the application of the algorithm to V and U by 

[si, s 2 , ...,s m } = DualSet(V,ZY, k). 

We refer the reader to [10] for the full description of the algorithm. Lemma 3.4 implies that one can sample 
from two different set of vectors V = {vi, . . . , v m } and U = {ui, . . . , u m }, and control simultaneously the 
smallest singular value of the matrix formed from the sampled vectors from the first set, and the largest 
singular value of the matrix formed from the sampled vectors from the second set. We now present our 
result. 
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Input: X £ M. nxrn , sampling parameter n < k < m. 
Output: Set S C [m] of cardinality at most k. 



1: Compute the matrix V £ ]^ mxn Q f the top n right singular vectors of X. 

2: Let V = {yi, . . . ,y m } (see Section 2 for the definition of y^'s). 

3: Let U = {ei, . . . , e m } contain the standard basis vectors. 

4: Run [si, s 2 , • • • , s m ] = DualSet(V,W, k) (Lemma 3.4). 

5: return S = {i : Sj ^ 0} 



Algorithm 3: A deterministic greedy selection algorithm for subset selection (Theorem 3.5.) 



Theorem 3.5. Fix X £ M nxm ^m, > n, rank(X) = n) and sampling parameter m > k > n. Algorithm 3 
needs O (^kmn 2 ^ operations and deterministically constructs a set S C [m] of cardinality at most k such 
that for both £ = 2,F: 

»4iii4+y!)Vy!f l|x ^ 

Proof. We first state the algorithm; then, we prove the approximation bound. Finally, we bound the 
number of operations. 

Algorithm. Algorithm 3 proceeds as follows. First, it computes the SVD of X: X = UX!V T (see also 
Section 2 for useful notation).. The second step is to apply the algorithm of Lemma 3.4 on V = {yi, . . . , y m } 
and IA = {ei, . . . , e m }, the standard basis, to compute the weights s\, . . . , s m . The algorithm then returns 
S = {i : Si ^ 0} . 

Approximation Bound. Lemma 3.4 guarantees that 

m \ / / \ 2 



or (EWj<(l + y|) 



However, X)I=i s i e i e 7 = diag(si, . . . , s m ) £ M. mxm , a diagonal matrix containing the weights Sj's in its 
main diagonal; so, maxj Si < (l + y^r) • Lemma 3.4 also guarantees that 

Assume that S = . . . , i^} where k < k and %\ < i% < ■ ■ ■ < i~ k , and let D = diag(y / Sfj", . . . , y / s^")- It 
is easy to verify that YT=i WiY? = Y 5D 2 Y|; so, Y 5 is full rank and ||(Y s D)t||| < (l - ^/f}~ 2 ■ The 
bound maxj Si < (l + y^r) 2 ear her implies that 1 1 H> 1 1 ^ < (l + \/t) 2 ' Now, observe that, 

||X^||| ( = } || (USY 5 ) t ||| ( = } HY^S- 1 ^!!! < ||Yi||l • ||X+||| ( i || (Y 5 DD^) f f 2 ■ ||X+||| 

< ||D||M|(Y 5 D)t||M|xt||| 



(/) 



2 

m\ I In 
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Input: X £ M. nxrn , sampling parameter n < k < m. 
Output: Set S C [m] of cardinality at most k. 

1: Compute the matrix V 6 ]^ mxn Q f the top n right singular vectors of X. 
2: For z = 1,2, ... ,m let (see Section 2 for the definition of y^'s), 

Tj = maxlHyilll, — }. 

m 

3: for t = 1,2, . . . , k do 

4: Pick where it = i with probability Tj. 
5: end for 

6: return 5 = {i±, 12, ■ ■ ■ , it} 



Algorithm 4: A randomized algorithm for subset selection (Theorem 3.7.) 



(a) follows by replacing X with its SVD. (6) follows by using Lemma 2.1 and the fact that all three matrices 
involved are full rank, (c) follows by standard properties of matrix norms, and using the definition of the 
pseudoinverse of X and S. (d) follows by introducing the identity matrix 1^ = DD 1 . (e) follows by using 
Lemma 2.2. Finally, (/) follows from the bounds we just proved for the terms ||D|||, and || (Y^D)^ |||. 

Operation count. The algorithm first computes an SVD of X, which costs O (mn 2 ). The second step 
is to run the algorithm of Lemma 3.4 on the right singular vectors of X and the standard basis, which 
costs O (kmn 2 ). So the total cost is O (kmn 2 ). ■ 

3.3 Randomized Selection 

The main idea in the algorithm of this section is to non-uniformly sample columns from X. The analysis 
is based on a matrix concentration bound from [47]. More specifically, we use Theorem 3.1 from [47] (the 
constants are from Corollary 4 in [55]). 

Lemma 3.6 (Theorem 3.1 in [47]). Let x 6 W 1 be a random vector, which is uniformly bounded almost 
everywhere: ||x||2 < M. Assume, for normalization, that ||E [xx T ] ||2 < 1. Let xi,X2,...,Xfc be k inde- 
pendent copies of x sampled with replacement. Then, for every e S (0,1), and with probability at least 
1 - 2 ■ n ■ e- 2fc / 4M2 : ||£ E*U ^1 ~ E [xx T ] || 2 < e. 

Our algorithm is based on non-uniform sampling with replacement. The sampling probabilities are 
related to the so-called leverage scores of the columns of X [12, 25], but in our algorithm we make sure 
that no column has a sampling probability that is too small. One needs cubic time to compute these 
probabilities using SVD or QR; our algorithm computes the probabilities that way. However, one can 
approximate these probabilities in sub-cubic time using recent results from [24]. It might be the case that 
these results be used to improve the running time of our algorithm, at the cost of some small increase in 
the approximation bound. However, we leave this issue for future research. 

Theorem 3.7. Fix X G E" xm (m > n, rank(X) = n). Choose a probability parameter 5 (0 < 
5 < 1). Now, choose an sampling parameter m > k > min([ 32n ln(2n/<5) ] , m). Algorithm 4 needs 
O (mn 2 + /clog A;) operations and randomly constructs a set S C [m] with cardinality at most k, such that, 
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for both £ = 2, F, and with probability at least 1 — 5, 

||Xly||! < 4-m - || X + 1|| . 
Moreover, z/X contains orthonormal rows the operation count is O (mn + k log A;). 

Proof. We first state the algorithm; then, we prove the approximation bound. Finally, we bound the 
number of operations. 

Algorithm. Algorithm 4 proceeds as follows. First, it computes the SVD of X: X = USV T (see also 
Section 2 for useful notation). Let, 

n = max{||yj|||, — } 
m 

for i = 1, . . . ,m. The set S is formed by non-uniformly, and independently, sampling k numbers from 
1, . . . , m with replacement. In each trial, j is sampled with probability 

m 

Pi = T i/XN- 

i=l 

Approximation bound. The first part of the proof is a technical manipulation to enable us to use 
Lemma 3.6. First, let C\, . . . , be the indices sampled in trials 1, . . . , k. That is S = {c±, . . . , c^}. For 
i = 1, . . . , k define the random vector Xj = y Ci /y/Pc^- Now, for i = 1, . . . , k define Si = ■ #{j : Cj = i}. 
Notice that S = {i : Si ^ 0}. Assume that S = . . . , i^} where k < k and i\ < z 2 < ■ ■ ■ < it, and let 

D = diag(y / s7j", • • • i With these definitions we observe that ^ Yli=i x i x J = YsD 2 Yj. 

Notice that xi, X2, . . . , x^ are i.i.d. Let x denote a random vector from the same distribution of 
xi,X2, . . . ,Xfc. To use Lemma 3.6 we need to compute E [xx T ] and to bound ||x|||: 

m m 

E t xxT ] =^Pi- -T^yi ■ -/=yl = Yl y*y? = r « ; 

t=l v pi v ft i=l 



|| ,|2 {a ) Yj 2 W YT=\ T i I! 112^^ WV- m |,2 n (*) ^ ||2 m 

x 2 < max — — = max y, 2 < ) Tj = > max{ y* 2 , — } < n+ > y, 2 = 2n 

jelm] Pj je[m] Tj ^ ^ m 

(a) follows by replacing the values taken by the vector x. (b) follows by replacing the value for the 
probabilities p^s. (c) follows by the fact that r,- > ||yj|| 2 , for all j = 1, ...,m. (d) follows by replacing the 
value for the parameters Tj's. (e) follows by simple algebra. 

We are now ready to apply Lemma 3.6 for the random vector y described above. An immediate 
application of this Lemma (M = \^2n, e = 1/2) and our bound on k give that with probability at least 
1 — 5, ||YsD 2 Y];— I n || 2 < i. (Recall that jXli=i x i x ?' = YsD 2 Yl;). Standard matrix perturbation theory 
results [29] imply that for i = l,...,n \a\ (Y 5 D)-1| < || Y 5 D 2 Y| - I n || 2 , so, % = n gives, ||(Y 5 D)t|| 2 < 2. 
We bound ||D|| 2 as follows 

||D|| 2 ( = } max D 2 , < k max ( —) < max ( SlZA 

j€[m] JJ je[m] \kpjj je[rn] \ Tj J 



M [f 1 ( 1 

= 7 Tj I • max 



je[m] \max{\\y j\\$,n/m} 



( e ) ( 1 

< 2 • n • max 



je[m] Vmax{||y,- \\\, n/m} 



(/) m 

< 2 • n • — = 2 • m 

n 
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(a) follows because D 2 is a diagonal matrix, (b) follows because each entry in D 2 might contain the term 
1/kpj, at most k times, (c) follows by replacing the values for the probabilities, (d) follows by replacing 
the values of the parameters Tj's. (e) follows by the fact that 5Z£=i r « — ^ n i which we proved in Eqn. (/) 
in the previous calculations. (/) follows by simple algebra. 

To conclude the proof, notice that at the end of Lemma 3.5, we implicitly proved that ||X^|| 2 < 
||D|| 2 • ||(Y 5 D)t||2 . ||Xt|| 2 . Replace the bounds for ||(Y 5 D)t||| and ||D|| 2 in this bound to wrap up. 

Operation count. First, we need to compute an SVD of X, which costs O (run 2 ). The probabilities 
can be calculated in 0(mn) and the sampling procedure can be implemented in 0(m + k log k). In total, 
the cost is O (rnn 2 + fclogfc). If X contains orthonormal rows, 0(mn + /clog A;) operations suffice. ■ 

3.4 Volume based bounds and algorithms 

We now consider bounds and algorithms which construct the set S by looking at the volume of the 
parallelepiped spanned by the rows of X5, which is exactly the determinant of X^X^. 

Subset Selection and Determinants 

We start with Lemma 3.8, which establishes the connection between determinants and subset selection. 

Lemma 3.8. Let X £ R nxm (m > n) be a full rank matrix, and let S C [m] be any subset of cardinality 
k (n < k < m) such that X5 is full rank. For i = 1, . . . ,n, letY~i G M( n_1 ) xm denote the matrix obtained 
after removing the ith row o/X. Then, 

f 2 _ £? = idet((YQ 5 (YOS 
l|X5llF " det (X 5 XJ) • 

Let X = [xi,X2, ...,x m ] be the column representation o/X. If S has cardinality exactly n, then 

IIY-1I.2 ^ |i Y ti|2 ^7=1 TJU (Xs(i -> Xj)) 2 
11X5 " F " " X " 2 d^? ' 

Recall that, X5 (i — > Xj) is the matrix by replacing the i-th column 0/X5 with Xj, the j-th column o/X. 
//X has orthonormal rows then the last inequality is an equality. 

Proof. We first prove the equality in the Lemma (S has cardinality k unless otherwise stated), 
\\XI\\ 2 F = Tr ( (XsXj)- 1 ) ® Tr (det (X^J)" 1 Adj (X 5 Xj)) <J det (X^Xj)" 1 Tr (Adj (X 5 X|)) 



det(X 5 Xj) 1 ^(Adj(X 5 xT)) j 

i=l 

n 

det (X 5 Xj)" 1 ^det((Y,) 5 (Y^ 

i=l 



(a) follows by a property which connects the Frobenius norm of the pseudoinverse with the trace operator. 
(6) follows by the well known formula for the inverse of a matrix using the adjugate matrix, (c) follows by 
the linearity of the trace operator, (d) follows by the definition of the trace operator. Finally, (e) follows 
by the definition of the adjugate matrix and the observation that the ith diagonal element of Adj (X^X^) 
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equals the determinant of an (n — 1) x (n — 1) matrix which is exactly X^X^ after removing its ith row 
and ith column. This matrix is exactly (Yj) 5 (Yj)^. 

We now prove the second inequality (S has now fixed cardinality k). Let X = USV T be an SVD of 
X (see also Section 2 for useful notation). Define Xj = USy,. Then, 



|v-l||2 W Hv-lv-iTTTll 2 <- IIV" 1 !! 2 \W~W\\ 2 Q 
\-*S Wf — II 1 S ^ U llF^II-^ II 2 " II 1 s 1 1 F — 



(J 

(e) 
(/) 



i^iii-eiiy^i 



iS-'lll-^HY^S-VUEy 



J 112 



i — 1 1 1 2 



Ef=i EILi det ((USY S ) (i -+ UE^))' 



1 112 detCUSY.;) 2 
|xt||2 E7=iEr = idet(X 5 (^ Xj ))^ 
' " 2 det(X 5 )2 



(a) follows by replacing the SVD of X^ 1 . Notice that X 5 = USY 5 and X^ 1 = Y5 1 S~ 1 U T . The 
latter equality holds because Y^ 1 is a square full rank matrix, which is immediate by the assumption 
that rank(Xs) = n. (b) follows by first using a property of matrix norms, and, then, dropping the square 
orthonormal matrix U T and inserting the matrix Y, which has orthonormal rows, to the Frobenius norm 
term, (c) follows by the definition of the Frobenius norm, (d) follows by inserting the identity matrix 
5] _1 U T U5] = I n . (e) follows by applying Cramer's rule to the linear system Ax = b, with A = USY5 
and b = Finally, (/) follows by replacing the appropriate values for X5, Xj, and HS" 1 ^. 

Notice that if X has orthonormal rows then 5] is the identity matrix, and (b) becomes an equality. ■ 



Random Subsets Chosen via Volume Sampling 

Lemma 3.8 connects determinants and the term ||X^||p, for any set S for which X5 has full rank. In 
the related work part of the introduction, we also stated various results for the specific set S C [m] of 
cardinality k > n that maximizes det(X T x£) over all possible 7~'s of cardinality k. Unfortunately, finding 
the maximum volume (determinant) subset is not only NP-hard [44, 19], but also exponentially hard 
to approximate [40, 18], so, these results do not yield an efficient algorithm. We solve this issue using 
randomization. 



Lemma 3.9. Let X £ 

VolSamp(X, k). Then, 



(m > n) be a full rank matrix and let m > k > n. Suppose that S 



E 



IX 



t || 2 
sIIf 



< 



m 



n+1 



k — n + 1 



ixtii 2 
1^ Hf • 



(2) 



(If for every set S £ C ([m],n) the matrix X5 is full rank then this bound becomes an equality). Also, for 
i = l,...,n, 

n(m — k) \ _ 2 



E [af (X*)] < 1 + 



In particular, 



E 



IX 



t || 2 
5II2 



< 1 + 



k — n + 1 

n(m — k) 
k — n + 1 



0~; 
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Proof. For i = 1, 



n. 



let Yj 6 1 ) xm denote the matrix obtained after removing the ith row of X. 



Using the definition of expectation and the equality of Lemma 3. 



E 
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SeC([m],k) 



det(X 5 Xj)||xtj||| w EseC([m],fe),rank(x s )=n det ( X 5 x s)ll x sllF 
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EseC([m],fc) det(X 5 X 5 ) 



Ssec(H,fe) det ( x sX|) 
m],fc),rank(X5)=n 

Er=idet((Y0 5 (Y,)I) 



E 



5eC(H,fc) 



det(X 5 X^ 



In (*), if rank(X,5) 7^ n, then, det(XsX£) = 0, so it can be ignored in the sum. We will now analyze the 
numerator and the denominator of the last relation separately. We start with the denominator. We have 

£ det(X 5 Xj) ^ £ £ det(X T x£) 

5eC(H,fc) 5eC(H,fe) rec(5,n) 



(6) / m 



- /? 
n 



) E de„x r x f , s (-;_-;) 



TeC{[m],n) 



where a\ > 02 > • • • > cr n are the singular values of X. (a) follows by applying the Cauchy-Binet formula. 
(6) follows from observing that each set in (^) is repeated exactly (TZT) times in the sum. (c) follows by 
applying the Cauchy-Binet formula again and the fact that for symmetric positive-definite matrices the 
determinant is equal to the product of the eigenvalues. 
As for the numerator, we have 



J2 ^det((Y i ) 5 (Y i )I) 
SeC(H,fc),rank(x s )=n i=1 



(a) 

< 



(b) 



(c) 



(e) 



x; E det (( Y ^)5(^)i) 

SeC([m],k) i=l 
n 

£ £ det((Y l ) 5 (Y l )I) 

»=i Sec([m],fc) 

n 

EE E det(CYi) r (Y 4 )£) 

i=l <SeC([m],fc)TeC(<S,n-l) 



i=i 



m 
fc- 



- n + 1 
n + 1 



£ det((Y i ) r (Y i )^) 
TeC(M,n-i) 



m 
fc- 

m 
fc- 



- n + 1 
n + 



- n 
n + 



1\ 

]Tdet(Y 4 Y 

' i=i 



t=l j^i 



(a) follows because we are adding only positive terms in the sum. (6) follows by applying the Cauchy-Binet 
formula, (c) follows from observing that each set in C ([m],n — 1) is repeated exactly (™Zn+i) times in 
the sum. (d) follows by applying the Cauchy-Binet formula again. Finally, in (e), the matrices YiYj 
are equal to the matrix obtained by deleting the i-th column and the i-th. row of XX T , so according to 
Lemma 2.4, £?=! det(Y*Y?) = £™=i U^i °l 

We now conclude the first part of the proof as follows, 
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(If Xs is full rank for every S € C ([m],n) then (a) in the previous calculations is an equality.) 

We now prove the bounds for the singular values of X5. Let 7~ be any subset of [m] of cardinality k 
such that X7- has full rank, and let 7" = [m] — T ■ Notice that T has cardinality m — k. Let, 



W 







(m— fc)xfe 



X-y-X-y- 
Im— k 



(Note that X^-X r G R kx ( m ~ k )). Since X r has full rank, we have 

( X T nx(m _ fc ) ) W = ( X r X T 



xn, 



where II £ l mxm is an appropriate permutation matrix. Clearly W is non-singular (it is a triangular 
matrix with a non-zero diagonal), so for i = 1, . . . , n, a simple matrix perturbation argument implies that, 

of (x r ) = of (( x T o nx(m _ fc) )) = of (xnw- 1 ) < ||W||| • of (X) . 



To bound 1 1 1 1 § we observe that, 



|W||| < 1 + ||X r X T ||l < 1 + ||x5-x r |||. 



Now, if S ~ VolSamp(X, k) then only «S's for which X5 is full rank have positive probability of being 
sampled. This implies that for i = 1, . . . , n, 



E [ar 2 (X 5 )] < E 



1 + iixtx 



S^\SIIF 



(X). 



We now bound E 



1 + IIX' X 



S^SllF 



. Let X = USV T be a SVD of X and let us denote Z = V T . Then 



it is easy to verify that X^X^ = Z^Z^. To bound the expected value of HZ^Z^Hp we observe that, 



ztzn 



Z S Z <S Z S Z S 



This implies that ||Z^ZII||p = llZ^Z^IIp+HZ^Z^IIp. Z^Zs is a projection; so, ||Z^Z,s||p = n. We now have 



|z^zn||p 



n 



+ ||ZjyZ5||p. ZII has orthonormal rows; so, ||ZtjZII|| F = ||Z^|||. So, ||Z^Z 



2 

SllF 



\7J II 2 -n 
|Zi 5 || F n. 



Since VolSamp(X, k) = VolSamp(Z, k), the Frobenius norm bound in the lemma guarantees that, 



E 



rt ||2 



< 



n(m — n + 1) 
k — n + 1 



Plugging that into the previous equality, we find that, 



E 



I Z 5 Z sIIf 



< 



This immediately gives a bound on E 



1 + ||X^X s ||p 



n{m — k) 
k-n + 1' 

, which concludes the proof. 



We can now prove the following corollary, which was previously stated as Lemma 3.2. 

Corollary 3.10 (Restatement of Lemma 3.2). Let X G Jj nxm ( m > n ) fre a full rank matrix. There exists 
a subset S C [m] of cardinality m — 1 such that X5 is full rank and 



|xU F < m - n + 1 .||xt||p 

m — n 
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Input: X G M. nxm , parameter n > 0. 
Output: Set S C [m] of cardinality ra. 

l: Let a = (1 + rf) ■ (m - n + 1) • ||Xt|||. 

2: repeat 

3: Apply VolumeSample from [34] to sample a subset S from VolSamp(X, n). 

4: until HX^ 1 !!! < a 

5: return S 



Algorithm 5: A randomized volume-based sampling algorithm for subset selection (Theorem 3.11.) 



Proof. Let T ~ VolSamp(X, m — 1). According to Lemma 3.9 we have 

m — n + 1 



E 



lift II 2 



< 



m — n 



The random variable ||Xt-||p is discrete, so it must assume at least one value larger than the expectancy 
with non-zero probability. Let S be one such set, so 



IX 



t ||2 



< 



m 



n + 1 



m — n 



•l|xt||| 



X5 must be full rank since T assumes it with some non-zero probability, but the distribution VolSamp(X, m- 
1) gives a zero probability to every subset 7Z for which X^ is rank deficient. ■ 

If there exists a subset S of columns of cardinality k such that these columns are linearly dependent 



then (2) might be a strict inequality. For example, let Y 



nxn Onx(m— n) 



) . There is only one set of 



cardinality n that has positive volume (i.e., the set of columns is full rank): T = [re]. Since this is the only 
set with positive probability we have 
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rt ||2 

sIIf 



n 



|Yt|| F < {m-n+l)\\Y 



t \\2 
s\\f 



(m — n + l)n . 



It is interesting to note that there is a discontinuity near matrices of this sort. Let Z be a matrix such 
that Y e = Y + eZ has full rank for every subset T and every e / 0. Obviously such a matrix exist. It is 
easy to see that for e — > we have ||Yj|| F 



lYtll 2 
I 1 Hf 



limE 

e->-0 



eJs\\F 



n. We find that 
(m — n + l)n 7^ n = E 



|Yt||2 



Volume Sampling Subset Selection 

To use Lemma 3.9 in an algorithm one needs a method to sample a subset from VolSamp(X, k). Computing 
the probabilities for all (Tj subsets and sampling according to them is not efficient; there are too many such 
sets. However, this is not necessary, since one can simulate volume sampling using polynomial number 
of operations using recent results from [22, 34]. More precisely, using the algorithm VolumeS ample 
from [34] we can sample a subset S from VolSamp(X,n) using 0(n 3 m) operations. Current determinant- 
based sampling algorithms [22, 34] can sample only k = n columns from X. We leave it as an open 
question whether one can efficiently sample from VolSamp(X, k) for arbitrary k > n. 
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Theorem 3.11. Fix X 6 R 7lXm (m > n, rank(X) = n) and sampling parameter k = n. Choose a 
parameter rj > 0. Algorithm 5 constructs a set S C [m] o/ cardinality k with, 

W^Wf < (1 + v) ■ (m-n + 1) ■ ||Xt|||; HX^H < (1 + rj) • (m - n + 1) ■ n • ||X + |||. 

For every < 6 < 1, the algorithm will terminate after O (mn 3 log (1/5)/ log (1 + 77)) operations with 
probability at least 1 — 5. 

Proof. First of all, we only need to prove the bound for the Frobenius norm. The bounds for the spectral 
norm follow by the fact that for any matrix B of rank n: ||B||| < ||B||p < tt, 1 1 B 1 1 2 - 

Algorithm. We describe Algorithm 5. We start by applying VolumeSample from [34] to sample a 
subset S from VolSamp(X, n) using 0(n 3 m) operations. We then compute ||X^ || F using 0(n 3 ) operations 
and compare it to (1 + 77) • (m — n + 1) ■ ||X^||p. If it is smaller than the bound we return S, otherwise we 
repeat this procedure until we find a satisfactory S. 



Approximation bound. Notice that we repeat the experiment t = 1,2,... times, constructing sets Si, 
S2.... We stop only if we find a satisfactory S, so the bounds are satisfied if the algorithm returns. 



t 112 
sIIf 



< 



Failure probability. Lemma 3.9 indicates that if S is sampled from VolSamp(X, n), then, IE ||X 

(m — n + 1) • ||X^|| F . For the first iteration t = 1, by Markov's inequality, we find that, with probability at 
most 1/(1 +77): ||X^ || F > (I + 77) (m — ?7 + 1) • || X^ ||p. Therefore, for a finite number of iterations i > 1, 

the probability that all t = 1, ...,£, satisfy ||X^ ||p > (1 + 77) (m — ?7 + 1) • ||X^||p is at most 1/(1 + 77)^. So, 
for any < 5 < 1 the probability \ log (1 /5) / log (1 + 77) ] successive iterations to fail is below 5. 

Operation count. Each iteration of the algorithm takes 0(n 3 m). In total, for any < 5 < 1 the 
operation count is O (mn 3 log (1/5)/ log (1 + 77)) with probability 1 — 5. ■ 



4 Lower Bounds 

4.1 Spectral norm subset selection 

Theorem 4.1 (Spectral Norm). For any a>0,n>0,m>2 with m > n, and k with n < k < m, there 
is a full rank n x m matrix X such that, for any subset S C [777] of cardinality k > n with rank(Xs) = n, 

, t 2 (m + a 2 \ t 2 



IPM^T^-iJ-ll*!.. 

Proof. We construct the matrix X as follows. Let A = [ei + ae2, ei + ae^, . . . , ei + ae m+ i] £ ^("M-ijxm^ 
and let A = U5]V T be the SVD decomposition of A. X is the first 77 rows of V T . 

To prove the bound we use Theorem 17 and Lemma 7 from [10]. Theorem 17 shows that if m > 2, 
then, for A defined above, for every subset S C [777] of cardinality k, we have 



|A A^AjjA)^ — — — ^2 ' 11-^- A-nlli: 



where A n is the best rank 77 approximation to A. Lemma 7 proves that for any matrix W S M. dxm , rank 
parameter 77 < rank(W), and sampling parameter n < k < m, for every subset S C [777] of cardinality k 
for which Z5 (Z is to be defined shortly) has full rank, we have 

|| W - W 5 W^W||2 < ||W - w„||l + 1| (w - W n ) s ||2 < (l + ||ZJ||1) • ||W - w n ||i . 
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Here, W n is the best rank n approximation to W. Z is denned as follows. Let W = U£!V T be the SVD 
of W. Z is the first n rows of V T . 

We now apply Lemma 7 from [10] on A and X and combine it with the bound from Theorem 17 of [10] 
mentioned above, to find that 

t > " A r A *4 A IH _ ! = ^ _ ! = _ A \\xHl 



5112 - ||A-A n ||2 k + a 2 \k + a 2 J 11 ll2 ' 

■ 

As a —7- 0, the bound in the above theorem is m/k — 1. If k = (1 + f2(l))n then the upper bound of the 
deterministic algorithm of Theorem 3.5 asymptotically matches this lower bound. The upper bounds of 
the algorithms in the Theorems 3.1, 3.7, and 3.11 and Corollary 3.3 are - asymptotically - slightly worse. 
However, if k = (1 + o(l))n there is a gap between the lower bound and the best upper bound. 

4.2 Probenius norm subset selection 

Theorem 4.2 (Frobenius Norm). For any a > 0, n, m with m > n, mod (m,n) = 0, and m/n > 2, and 
k with n < k < m, there is a full rank n x m matrix X such that, for any subset S C [m] of cardinality 
k >n with rank(Xs) = n, we have 

llxt|| 2 >f^4 + i-^Vllxt|| 2 



Proof. We construct the matrix X as follows. Consider a block diagonal matrix B E M. dxm : a matrix 

A E M,d/nxm/n of the form that 

appear in the proof of Theorem 4.1 is repeated n times on B's main 
diagonal. Let B = USV T is the SVD of B. X is the first n rows of V T . 

To prove the bound we use Theorem 19 and Lemma 7 from [10]. Theorem 19 in [10] indicates that for 
the above B, any n < rank(B), and k > n, any subset S of k columns of B satisfies, 

B - B5B5B p = 1 + ——3 • B - B n 2 . 

m — n \ k + cr / 



Using 0-1 (B) = ct 2 (B) = • • • = ci n (B) = n + a 2 ; a n+1 (B) = cr n+2 (B) = • • • = a m = a 2 ; ||B - B 
and ||B — B n ||p = (m — n)a 2 , we obtain, 



|2 _ ^,2. 
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a 



|B - B^B^BH 2 = (m — k) ■ 11 + — - • ||B - B n || 2 . 



A; + a' 2 

Now, Lemma 7 of [10] implies that, for any matrix W E M. dxm , rank parameter n < rank(W), and 
sampling parameter n < k < m, for any S of cardinality k, if Z5 has full rank (Z is defined shortly), 

|| W _ WcjW^WH 2 < ||W - W n || 2 + II (w - w n ) s z^w 2 . 

Here, W n is the best rank n approximation to W. Z is defined as follows. Let W = U£!V T be the SVD 
of W. Z is the first n rows of V T . Applying spectral submultiplicativity to this relation we obtain, 

||w - w^wJavii 2 < ||W - W n || 2 + ||ZJ;|| 2 • ||W - W n || 2 . 

We now apply Lemma 7 from [10] on B and X and combine it with the bound from Theorem 19 of [10], 

ll Y ti.2 >> llW-WgW^WHl ||W-W W || 2 / n \ ( m-k k\ f 2 

An \ t > r, ^ttt — — — t, rr4r = (m — k )■{ 1 H ^ — (m — n ) = tt + 1 • X t?- 

11 5llF - ||W-W n ||2 ||W-W n ||2 v ! \ k + a 2 ) K ' \k + a 2 nj n " F 
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As a — > and k = O (n), this bound is m/k — 0(1). If k = (1 + f^(l))n the Frobenius norm bounds in 
Theorems 3.1 and 3.5 asymptotically match this lower bound. However, if k = (1 + o(l))n there is a gap 
between the lower bound and the best upper bound. There is also a gap when k = ui(n). We believe that 
the gap for k = co(n) is the result of looseness in the lower bound, but we were unable to prove a tighter 
bound than Theorem 4.2. 

5 Low-stretch Spanning Trees and Subset Selection 

Let G = (V, E, w) be a weighted undirected connected graph. Unless otherwise stated, in this section, we 
denote the number of vertices of G by n, and the number of edges by m. Let T = (V, F, w) be a spanning 
tree of G, where F is a subset of E having exactly re — 1 edges. We use the same weight function w because 
the edges in T have the same weights with the corresponding edges in G. Since T is a tree, every pair of 
vertices in T is connected by a unique path in T. For any edge e G E, let us denote by pr(e) the set of 
edges on the unique path in T between the incident vertices of e. The stretch of e with respect to T is 
St-r(e) = Yle'&p T (e) • The stretch of the graph G with respect to T is [2] 

Str(G) = J^Str(e). 

e€E 

The problem of finding a low-stretch spanning tree is the problem of finding a spanning tree T of G such 
that Stj'(G) is minimized, among all possible spanning trees of G. Let St(re) = maxG<=G„ min^ St*r(G), 
where G n is the family of graphs with n vertices. The following bounds are known: St(ra) = J7(mlogn) [2]; 
St(n) = 0(m log n • log log n • (log log log n) 3 ) [1]. In this section we show that finding a low-stretch 
spanning tree is in fact an instance of the Frobenius norm version of Problem 1.1. 

Finding a low stretch spanning tree has quite a few uses. One important application is the solution of 
symmetric diagonally dominant (SDD) linear systems of equations. Boman and Hendrickson [6] were the 
first to suggest the use of low-stretch spanning trees to build preconditioners for SDD matrices. Spielman 
and Teng [51] later showed how to use low stetch spanning trees to solve SDD systems using a nearly 
linear amount of operations. The currently most efficient algorithm for solving SDD systems [41] uses a 
low stretch spanning tree as well. One of the many obstacles in generalizing these algorithms for wider 
classes of matrices (e.g., finite-element matrices) is the lack of an equivalent concept, like the stretch, for 
such matrices. By studying the purely linear-algebraic nature of the low-stretch spanning tree problem 
(i.e. the Frobenius norm version of Problem 1.1), our hope is to glean new insights on how to generalize 
the concept of low-stretch trees, or to substitute it with something else. 

Other applications of low-stretch spanning trees include: Alon-Karp-Peleg-West game, MCT approxi- 
mation and message-passing model. See [26] for details. 

Next, we show that finding a low-stretch spanning tree is an instance of subset selection. We first 
relate graphs to matrices. 

Definition 5.1 (Edge-vertex incidence matrix/Laplacian matrix). Let G = (V,E,w) be a weighted undi- 
rected graph. Assume, without loss of generality, that V = {1, 2, . . . , n}; E = {(u\, v\), (u2,V2), ■ ■ ■ , (u m , v m )} 
The edge-vertex incidence matrix of G is Uq £ M nxm , where column i of Uq is y w(iti,Vi)(e Ui — e Vi ). 
Here e±, e2, . . . , e n are the identity (standard basis) vectors. The Laplacian matrix of G is Lg = TIgTIq. 

Every column in represents an edge in G. A spanning tree T is a group of edges that span G and 
form a graph. The set of edges in T correspond to a set of columns in lie which we denote by S(T). 
Notice that if the indices are kept consistently, then, Ily = (Ug)s(t)- If «5 C [m] is a subset of columns, 
then, there is a subgraph H of G that contains the edges corresponding to the columns in S. We denote 
this subgraph by H(S). We are now ready to connect low-stretch spanning trees and subset selection. 
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Theorem 5.2. Let G be a weighted undirected connected graph. Let TIq = USV T be the SVD of Hq 

with U G Rnxfr-i), £ G R(«-i)x(n-i) and v G R mx(n-i) (q is connecte d ; S0 , rank(n G ) =n-l). For 

notational convenience, let Y = V T . 

1. If T is a spanning tree of G, then, St^(G) = ||Y^ T J|p. 

2. If S C [m] has cardinality n — 1 and Y5 has full rank, then, H(S) is a spanning tree of G. 

Proof. To prove the first part of Theorem 5.2, we need a result of Spielman and Woo [50], who recently 
connected the stretch of G with respect to T to the matrix LgL^. More precisely, Theorem 2.1 in [50] 
shows that if T is a spanning tree then St T (G) = Tr (l g l{,) . Here, G is a weighted undirected connected 
graph and T is a spanning tree of G. 

Let us denote S = S{T). Since T is a tree we have rank(n^) = n — 1 (it is well known that the edge 
incidence matrix of a connected graph has rank \V\ — 1). Since 11^ = (TIg)s = USIY5, Y5 must be full 
rank. Now, 

Str(G) ( = } Tr (l g 4) i Tr (u G Ul ((U G ) S (U G )^) ^ Tr (uS 2 U T (U£Y 5 YjSU T )t) 

^ Tr (U£ 2 U T U5r 1 (Y 5 Y£) E _1 U T 
( ^ TrfsrY^)- 1 ^ 1 



^ ^((YsYj)- 1 

— IIV" 1 ll 2 

— II 1 «S(T) llF " 

(a) follows by the Spielman- Woo result. (6) follows by replacing the Laplacian matrices with the product of 
their edge-incidence matrices, (c) follows by introducing the SVD of and the equality (n^s = USY5. 
(d) follows since all three matrices involved (U, Y5, and S) are full rank, (e) follows since U has 
orthonormal columns. (/) follows since Tr(AB) = Tr(BA). 

We now prove the second part of the theorem. For H(S) to be a spanning tree, it has to be a connected 
graph with n — 1 edges. The last condition is met since S has cardinality n — 1, and the number of edges 
in H{S) is equal to the cardinality of S. As for connectivity, notice that Il H r s \ = (Uc) s = USY5. Now, 
since Y5 has full rank, we have rank(Il^(5)) = |V| — 1. This directly implies that H(S) is connected. ■ 

The algorithms that we presented in Theorems 3.1 and 3.11 in Section 3 can be used to find a low-stretch 
spanning tree (run these algorithms on the matrix Y of the above theorem), but they are not competitive 
both in terms of operation count and in terms of approximation bounds. Both these algorithms can 
guarantee Sfrr(£r) < (n — l)(m — n + 2). The operation count is (m 2 n) and 0(mn 3 ), respectively. This 
upper bound also holds for the easily computable maximum weight spanning tree. Koutis et al. describe 
in [41] an algorithm which gives the available state-of-the-art upper bound St^(G) < 0(mlogn • log log n ■ 
(log log log n) 3 ), and has operation count of 0(m log(n) +nlog(n) log log(n)). The main reason for this gap 
is that our algorithms are designed for general matrices, while [1, 41] describe a graph algorithm, which 
better exploits the unique structure of the problem. Nevertheless, when reinterpreting our algorithms as 
algorithms for constructing low-stretch spanning trees yields interesting connections that sheds light on 
both problems, as we discuss below. 

5.1 Low-stretch spanning trees via the greedy removal algorithm 

Theorem 5.2 along with Theorem 3.1 suggest a greedy removal algorithm for constructing a spanning tree 
with low stretch: start with a full set of edges H = E; then, at each iteration, find the edge e such that 
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|| ± s , H _ry. Hp is minimized, and set H < — H — {e}. Finish once H has n — 1 edges. That is, we apply the 
algorithm of Theorem 3.1 on Y (see Theorem 5.2 for the definition of Y). We note that this algorithm 
is different from the natural greedy removal algorithm, which would remove edges to keep the stretch of 
the subgraph minimal in each step. It is possible to define the stretch St# (G) of a subgraph H; we refer 
the reader to chapter 18 of [46] for the definition. It is also possible to show that for a spanning subgraph 
H, llY^^llp < Stff(G) (we omit the proof), but an equality does not hold. In fact, our algorithmic 

results imply that it is possible to find a subgraph with 0(n) edges such that ||Y^^||p = 0(m), but 
there exists a graph for which every subgraph H of 0{n) edges we have St#(G) = f2(mlog(n)) (Corollary 
18.1.5 in [46]). 

We conducted some simple experiments with our greedy removal algorithm. In the first experiment, we 
used greedy removal to generate a spanning tree T n of the complete graph K n on n with equal weights ver- 
tices, for n = 10, 11, . . . , 50. We then computed the stretch of T n . We found that StT n (K n ) « 0.6m log 2 n. 
We then repeated this experiment with random weights on the edges of K n . We found that in almost all 
runs, StT n {K n ) 0.3m log 2 n. These values are much better than our theoretical bounds, and are closer 
to what it is possible to find using state-of-the-art algorithms for low-stretch trees. These experiments, 
although far from exhaustive, suggest that our theoretical worst-case upper bounds for greedy removal are 
rather pessimistic for the matrices relevant to finding a low-stretch spanning tree. 

5.2 Maximum weight spanning trees and maximum volume subsets 

The volume corresponding to a set S has a very natural interpretation when S is a subset of columns in 
n G , and it corresponds to a tree. Let II G = USV T be the SVD of II G (U £ R«x(n-i) ) s G R (n-i)x(n-i) ( 
and V £ flj mx ( n_1 ); Q i s connected so rank(II G ) = n — 1). For notational convenience, let Y = V T . 
Define U to be the first n — 1 rows of U; U is a square matrix). Define n G to be the first n — 1 rows of 
II G . Notice that n G = USY. This implies that <Sfc(Y) = 5^(11^) and also that the set S that maximizes 
det(Ys) 2 also maximizes det((n G ),s) 2 . 

Let T be a spanning tree of G. Determinants of the form det((n G )s(y)(f[ G )£p^) have a very natural 

interpretation. The matrix (YIg)s(t)(X^g)s(^t) 1S a Laplacian of a graph for which a column and row of 
some vertex have been removed. It is well known that the determinant of such matrices, when the graph 
is a tree, is equal to the product of the weights of the tree edges. That is, 

det((n G ) s(T) (n G )J (T) ) = J] w{e) . 

The subset of columns S that maximizes the volume also maximizes rieeff(cS) w ( e )- This trivially 
implies that the corresponding tree is a maximum weight spanning tree. So, for edge-incidence matrices 
one can use an efficient maximum weight spanning tree algorithm to find the maximum volume subset of 
columns efficiently. The bound we obtain is Stx(G) < (n — l)(m — n + 2). We are unaware of any other 
analysis of the stretch of a maximum weight spanning tree, but this bound can be easily proven using 
much simpler arguments. 

5.3 Low-stretch spanning trees via volume sampling 

Recall Problem 1.1, and let the input matrix be the matrix Y from Theorem 5.2. Using volume sampling 
(Lemma 3.9) to sample a subset of columns from this Y corresponds to sampling a random spanning tree, 
where a tree T is sampled with relative probability Yl e€ T w ( e )- We denote this probability distribution 
on spanning trees of G by T(G). Lemma 3.9 provides a bound on the stretch of a random spanning tree 
sampled from T(G). Notice that it is a strict upper bound. The reason is that not every subgraph H is a 
tree. We conjecture that this bound is pessimistic, and leave for future work the refinement of the bound. 
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Corollary 5.3. Let G be a weighted undirected connected graph, and let T be a random spanning tree, 
where tree T is sampled with relative probability Y\ ee x w ( e )- Then, 

E [St r (G)] < (n - l)(m - n + 2) . 

One can use VolumeSample from [34] to generate such a spanning tree in 0(n 3 m) operations. How- 
ever, the problem of generating a sample from T(G) is a well studied problem, and there exists algorithms 
that can generate a random spanning tree faster than 0(n 3 m). See [56] for a short review. 

5.4 Towards better bounds for low-stretch spanning trees 

State-of-the-art algorithms for finding low stretch spanning trees attain theoretical worst-case bounds that 
are better than the ones we obtain for a general matrix. We now provide a preliminary explanation for 
this gap. 

Consider a matrix X £ R nxm with orthonormal rows such that for every subset S C [m] of cardinality 
n, det(Xs) 2 = or det(Xs) 2 = C, for some constant C. The second inequality of Lemma 3.8 is 

||x _ 1||2 = S^iS^idetCX^xj)) 2 
5 " F det(X 5 ) 2 

(here it is an equality because X is orthonormal; also, ||XT||| = 1). Since all determinants are or C, we 
find that 

11X^111 = #{T : rank(X r ) = n,T = (S -{t})U{j} for ie S, j £ [m]} . 

That is, for a subset S such that the columns of X in S form a basis for the column space of X, HX^Hp 
is equal to the number of bases that can be obtained by replacing a single column. The last quantity can 
only be bounded universally by n(m — n + 1), and that quantity is obtained for all 5s if X7- has full rank 
for every subset T ■ However, if there exist at least one subset T for which X7- is singular then there is a 
subset S for which the n(m — n + 1) bound is strict. If many such sets exist, then, the bound is probably 
very loose. 

Now, let us consider the incidence matrix of a complete graph with equal weights. If for a subset S the 
subgraph H(S) is not a tree, then, det(Yg) = (Y is defined in Theorem 5.2) . Every tree has exactly 
the same weight, so for all <S's that correspond to trees we have the same det(Ys) 2 . We see that Y falls 
into the case discussed above. We conclude that the reason that Y has a subset of column S for which 
|| Y^ 1 ||p is small is the fact that for some subsets T the matrix Y7- does not have full rank. In fact, for 
the complete graph, most cardinality n — 1 subsets of edges will not result in a tree or a full rank Y5. If 
we could enumerate these subsets exactly, this should give a better upper bound for this special matrix. 

6 Other Applications 

6.1 Column-Based Low-Rank Matrix Reconstruction 

Suppose we want to build a low rank approximation of A £ IR dxm . For a rank parameter r < rank(A), let 
A r £ ]j rfxm denote the best rank r approximation to A. That is, A r minimizes ||A — B 1 1 2 , over B, where B 
ranges on all rank r dxm matrices. It is well known that A r can be computed via the SVD of A. However, 
SVD uses all the columns of A to compute A r . In some applications it is desirable to use only a small set 
of columns to build the low rank approximation (see [10] and references there in for such applications). 
Let S C [m] and A5 contains a subset of columns of A indicated in S. Define Hs iT (A.) £ ]^ dxm ^ Q De 
the best rank r approximation of A within the columns space of As, with respect to the spectral norm 
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(if S = [m], then lis.,. (A) = A r ). The so-called column-based low-rank matrix reconstruction problem is: 
given A, r < rank(A), and an sampling parameter k > r, find a subset S of cardinality at most k such 
that || A — IIs r (A)||2 is minimized among all the possible choices for the subset S. 

It is natural to evaluate IIs^A) in terms of A r . That is, provide approximation bounds of the form 
||A — IL5 ]r (A)|| 2 < a ■ ||A — A r H2- Currently, the best deterministic such algorithms are available in [10]. 
These algorithms achieve asymptotically optimal upper bounds, but there is still room for improvement 
in terms of lowering the the operation count. 

The algorithm of Corollary 3.3 can be used to obtain a new deterministic algorithm for the column- 
based low-rank matrix reconstruction problem. First, construct an SVD decomposition A = USV T . Let 
X £ ]j rxm be the first r rows ofV T . We now use the algorithm of Corollary 3.3 on X to generate a subset 
S C [m] of size k, which is the result of the algorithm. The following bound holds, 

||A - n 5ir (A)|| 2 < J 2 + r } m ~ k \ ■ II A - A r || 2 . 
V k — r + 1 

We omit the proof, which follows by combining Lemma 7 from [9] and Corollary 3.3. The algorithm is 
deterministic and the operation count is T$vd + 0(mr{m — k)), where T$vd is the number of operations 
needed to compute the top r right singular vectors of A. Our approximation bound is slightly worse than 
the bounds in [10] but bound on the number of operations can sometimes be better, depending on the size 
of the input matrix. We refer the interested reader to [10] to conduct her own comparison. 



6.2 Sparse Solutions to Least-squares Regression Problems 

Fix inputs A £ K tlxm and b £ consider the following least-squares problem, min xe Rm ||Ax — t>|| 2 - 
Since there is no assumption on d and m, or that A is full rank, the minimizer of || Ax — b|| 2 might not be 
unique; there might be a full subspace of minimizers. Even if there is a unique minimizer, it might have a 
huge norm, while there exists an almost-minimizer with small norm. It depends on the application what 
exactly is needed, but often some kind of regularization is used to address the issues just mentioned. One 
popular regularization technique is truncated SVD [35]: for r < rank(A), let A r £ flj dxm Q f rank r denote 
the rank-r SVD of A; then, the truncated SVD regularized solution is given by x s „d(r) = Ajb £ R m . 

However, sometimes a different regularization is sought: requiring the solution vector to be sparse. 
That is, we are interested in constructing a vector x& £ M. m that has at most k non-zeros, for some k. 
Since truncated SVD is arguably the most natural regularizer, it makes sense to compare x^ to x svd ^ for 
some r < k. More specifically, we are interested in bounds of the form, 

|| Ax fc - b|| 2 < ||Ax S7jd ( r ) - b|| 2 + a . 

The idea of obtaining sparse solutions with approximation bounds of the above type can be traced to [15]. 
Currently, the best deterministic method is in [8] (k > r) with a = (l + ||b|| 2 ||A - A r \\ F /a r (A). 

The algorithm of Corollary 3.3 can be used to design a new deterministic algorithm. First, construct 
an SVD decomposition A = UX!V T . Let X £ M rxm be the first r rows of V T . We now use the algorithm 
of Corollary 3.3 on X to generate a subset S C [m] of size k. We now compute x^ = A^b £ We 
form Xfc by spreading the indices of x^ to their corresponding place in x& and putting elsewhere. The 
following bound holds, 



r(m — k) \ .,, „ cT r +i(A) 
|Ax fe - b|| 2 < \\Ax svd(r) - b|| 2 + 1 + d y [ ||b|| 2 - 



k-r + lj" 112 o>(A) 

We omit the proof since it follows immediately by combining Lemma 3 from [8] with Corollary 3.3. The 
algorithm is deterministic and the operation count is 0(dm min{<i, m} + mr(m — k)). 
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Our bound essentially contains the term \Jm — k ■ a r+ \(A) in place of the term ||A — A r ||p in the 
bound of [8]. It is always the case that ||A — A. r ||p < y/m — r ■ <7 r+ i(A), but since k > r, our bound might 
be better in some cases (e.g. when k — > m). 

6.3 Feature Selection in fc-Means Clustering 

The deterministic algorithm of Corollary 3.3 can also be used for deterministic feature selection in fc-means 
clustering. We refer the reader to [11] for an introduction to this problem. Theorem 4 of [11] gives such 
a polynomial deterministic unsupervised feature selection algorithm, which selects features from the data 
and then rescales them. Using Corollary 3.3, one can design a deterministic unsupervised feature selection 
algorithm without rescaling. We omit the details, since the algorithm is similar to the one described for 
sparse least squares, and the analysis is a combination of Lemma 10 from [11] with Corollary 3.3. The 
approximation bound that is obtained is 0(r{m — k) /{k — r + 1)); here, m is the number of features in the 
input points, r is the number of clusters, and k is the number of features after feature selection. 

7 Open Problems and Future Directions 

Several interesting questions remain unanswered and we leave them for future investigation. First, is the 
Frobenius-norm version of Problem 1.1 NP-hard? Second, is it possible to close the existing gaps between 
lower and upper bounds for Problem 1.1? Third, is it possible to extend the Strong Rank Revealing 
QR method of [32] to sample arbitrary k > n columns? Fourth, is it possible to extend the polynomial 
implementations of volume sampling in [22, 34] to sample arbitrary number of columns from short-fat 
matrices? Finally, is it possible to derandomize the algorithm of Theorem 3.11? 
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